-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
APK Parser (#1953) #1956
APK Parser (#1953) #1956
Conversation
Thank you very much @tc-wleite! |
I ran a performance test with the 4 UDFRs which were extracted from Android phones used in #1912 tests, comparing total processing time between master branch and this PR (which includes an APK parser):
So, using the APK parser would reduce a bit the processing time of Android UFDRs. |
Hi wladmir
I will review tomorrow your code
could you share your test cases with me?
Em seg., 13 de nov. de 2023 16:34, Wladimir Leite ***@***.***>
escreveu:
… I ran a performance test with the 4 UDFRs which were extracted from
Android phones used in #1912
<#1912> tests, comparing total
processing time between master branch and this PR (which includes an APK
parser):
Since there was a lot of performance changes recently, I repeated the
tests:
# Type Size (GB) Total Items Master Processing Time (s) Processing Time
(s) using the APK Parser Processing Time Reduction
1 UFDR 2.4 57,495 *555* *500* *10%*
2 UFDR 18 118,297 *1,494* *649* *57%*
5 UFDR 33 697,193 *3,553* *3,008* *15%*
6 UFDR 37 1,596,925 *4,342* *3,546* *18%*
So, using the APK parser would reduce a bit the processing time of Android
UFDRs.
—
Reply to this email directly, view it on GitHub
<#1956 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG247S7HFD6VE6U76ZBCAO3YEJ74VAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGA2DKMRXGM>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
Sure! |
I just copied a 7-Zip (~14 GB) with some samples to \Temp\Wladimir. |
I'll start a crawl in BSB cases database through the night. |
@patrickdalla I just shared 37K APKs from 87 cases with you. |
Hi @tc-wleite , IPED already have an certificate parser. I extracted the certificate data and enabled CertificateParser in ParsersConfig.xml (not enabled by default) and resulted in something like the image below. This could be a reasonable approach? @lfcnassif I don't remember well why CertificateParser is not enabled by default. What I remember is that the certificate carver carves a lot of certificates, many of them duplicated many times. Can we reenable it? |
Hi @patrickdalla! Just a couple of minor details:
|
CertificateParser. Reenable and makes some little improvements to CertificateParser.
I don't remember either, maybe it caused a performance impact, and since most cases don't need certificate parsing information... Or maybe the impact was caused by the Certificate carver, since it recovers a lot of items. One option would be enabling the parser and keeping the carver disabled by default. By the way, Tika has a Pkcs7Parser, also disabled by default intentionally a long ago, which parses a subset of mimes handled by your parser, and ideally a comparison between them would be good, not sure if it was done. PS: To enable CertificateParser, we should create an issue to report it in the 4.2.0 release notes. |
I have just reviewed Pkcs7Parser code from tika. Pkcs7 is a container spec to hold content and its signature info in same file/stream. Pkcs7Parser of tika only strips/ignores the signature and delegate the content parsing to the corresponding parser. Pkcs7Parser doesn't parse any signature and respectives certification information. Pkcs7 is most used to save certification revogation list and certification files itself (when included with entire certificates of certification path). The CertificateParser uses java.security.cert.CertificateFactory that can extract the certificates these files PKCS7 formatted content. PKCS7 is not the format of the certificate used to sign the APK. It seems from https://issues.apache.org/jira/browse/TIKA-3205, code done after the implementation of CertificateParser, that TIKA didn't classified PEM and DER files as "x-x509-ca-cert". But now it do. I have created in CertificateParser "application/x-pem-file" and "application/pkix-cert" mime-types to identify this kind of content, but now it seems it can use the new "application/x-x509-ca-cert" identified by Tika. I will create an issue with this same info. |
Anyway, I will finish reviewing this PR. If CategoriesToExpand.txt is configured to expand "Android Applications" category, it will create a new subitem with the certificate data. Tika is classifying this new subitem as application/x-x509-cert. When CertificateParser is reviewed later, it will parse this accordingly. |
Thank you @patrickdalla for your analysis! |
PR. The user may config it in his profile if is his desire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After internationalization code and the creation of specific category for APK, it is good.
Thanks @patrickdalla for your review. @tc-wleite is @patrickdalla approach fine to you? |
Sure, it seems fine! I can run a last test before merging, but I will be able to do that in the afternoon. |
That would be great! But please enjoy the holiday, this is not urgent. |
Hi @lfcnassif! |
Thank you @tc-wleite for your adjustments and for testing again! Before merging, I would like to do 2 comments:
|
It was. but I reverted to put changes related to CertificateParser in the
other PR, and forgot to revert this inclusion on MakePreview. You can
revert if you want to merge this PR still today, or tomorrow I will do.
Em qua., 15 de nov. de 2023 20:35, Luis Filipe Nassif <
***@***.***> escreveu:
… Thank you @tc-wleite <https://github.com/tc-wleite> for your adjustments
and for testing again! Before merging, I would like to do 2 comments:
- Maybe changes in CertificateParser would fit better in #1981
<#1981>, but they look fine;
- @patrickdalla <https://github.com/patrickdalla> is the addition of
application/x-pem-file to MakePreviewConfig.txt intentional here?
—
Reply to this email directly, view it on GitHub
<#1956 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG247S7LO3RWSSMA556RVW3YEVNVVAVCNFSM6AAAAAA6UGUY7CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMJTGUYTQMZYGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Thanks @patrickdalla. I also removed the changes in CertificateParser, since most of them are included in #1981 to avoid possible merge conflicts. |
Closes #1953.
I managed to overcome the main limitations of the library used to parse the APKs.
There are a couple of TODO's related to certificates dates metadata that can be solved during review.
Processing 1K APKs (~12 GB) now took about 1 minute, while it was taking ~25 minutes before (with the default PackageParser). Index had 2.5 GB before, and now 20 MB.
A sample output of the parser: